Project Overview

Plagiarism Detection Project

In this project, you will be tasked with building a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not , depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious.

Later in this lesson, you'll find a link to all of the relevant project files.

Defining Features

One of the ways you might go about detecting plagiarism, is by computing similarity features that measure how similar a given text file is as compared to an original source text. You can develop as many features as you want and are required to define a couple as outlined in this paper (which is also linked in the Lesson Resources tab. In this paper, researchers created features called containment and longest common subsequence .

In the next few sections, which explain how these features are calculated, I'll refer to a submitted text file (the one we want to label as plagiarized or not) as a Student Answer Text and an original, wikipedia source file (that we want to compare that answer to) as the Wikipedia Source Text .

You'll be defining a few different similarity features to compare the two texts. Once you've extracted relevant features, it will be up to you to explore different classification models and decide on a model that gives you the best performance on a test dataset.